While kings and queens of the past preferred emeralds, sapphires and rubies, the modern stone of glamour is the diamond. They are valued for their perceived rarity, for their timelessness, and for the stories that can be told about them.1 At the diamond dealer, we are told to look for value in the Four C’s – carat, cut, color, and clarity.2 To make sense of these factors and how they affect the price of a diamond, we analyzed data from the world’s leading online diamond dealer, Blue Nile. Prices varying from $229 to over $2.3 million were best modeled using all features available in the data set using multiple linear regression. The model accounts for 98% of price variation, and has higher confidence for lower prices than the higher prices. Large, expensive diamonds, which are usually natural rather than lab-grown, may have extra price variation due to their immeasurable qualities like legacy, history, rarity, and glamour.
The Blue Nile Diamond data set contains the price, carat, cut, color, and clarity of over 210,000 diamonds for sale on their website at the time the data was gathered. Prices for these diamonds vary widely from $229 to over $2.3 million, and the average price is $5540. We wanted to investigate the effect that the Four C’s have on diamond prices and how well these characteristics alone determine the price. After all, there should be some value that this data does not account for like lab-grown versus natural, or the more ethereal quality of each diamond’s unique history. If we use this data set to create a price prediction model, how confident can we be that the model is correct for diamonds in general? And what is the estimated margin of error if we use it to predict the price of a diamond found at our local jeweler?
To answer these questions, we started by examining the relationship between price and each of carat, cut, color, and clarity alone. We found that carat, the physical weight of the diamond, had a strong influence on diamond price. Estimating price based on carat alone accounted for 56% of the variability in price between the diamonds in Blue Nile’s data. This makes sense as much of the high price that diamonds carry is due to their limited supply, and larger stones are more rare. One interesting observation is that carat weights that are round numbers seem to be more common. Similar to how a customer buying a car might find it easier to stomach paying $4999 rather than $5000 even though the price is virtually the same – diamond buyers likely perceive a 2 carat diamond as much better than a 1.9 carat diamond. It appears that gemologists are sizing their diamonds with this marketing in mind.
Next, we investigated how a combination of the Four C’s related to price. With all these factors combined, we were able to account for 98% of the price variation observed in this data set. We also proved that each diamond characteristic in the data is useful to keep in the model; none could be disregarded without a significant impact on the predictive ability of the model.
Finally, we checked to see if any interaction among diamond qualities had a significant effect on price.
To show the predictive power of the model we developed during this analysis, we took two diamonds of different price levels as examples. From the high price tier, we have a 9.09 carat diamond with VS2 clarity, I color, Ideal cut, and a price of $258,497. Then we looked at a lower priced $3851 stone weighing 1.1 carat with VS1 clarity, I color, Very Good cut. In our initial glance at this data, we noticed that prices for high priced, large carat stones vary a lot more than prices for less expensive, smaller stones. This is again evident from predicting prices for our two sample diamonds.
As downloaded from Blue Nile, our data set was clean with consistent formatting and no missing values. The data contained 210,638 rows in 5 columns. Two columns were numeric – price of the diamonds in dollars and carat weight. Three other categorical columns were present – color, clarity, and cut. These are industry standard ratings with alphanumeric designations.
Price is our response variable of interest. Diamond price values ranged from $229 to over $2.3 million with an average of $5540, a median of $1432, and a standard deviation of $22,898. This data does not appear to be normally distributed. Our mean is much smaller than the standard deviation, indicating that the data are spread away from the mean. We can also tell by the median and mean that our data has a positive skew with a long tail on the right. Two histograms of price are shown below – the first with linear scale for the price axis and the second with logarithmic scale for the price axis. Plotting price on a logarithmic scale is necessary to see the data clearly, and we repeat this procedure in all the graphs to follow.
The carat unit of measure describes the weight of a diamond with extreme precision. One carat is the weight of a single paperclip, and diamonds are usually weighed to the one-thousandth of one carat.3 This precision is extremely important because a diamond’s price per carat is typically very high, and errors in measurement can have a high cost. Our data appears to be rounded to the nearest hundredth of a carat. A quick scatterplot of price versus carat weight appears to show a strong positive relationship between the two. We also calculated a high, positive coefficient of correlation value of 0.752.
Color rates a diamond’s hue on a scale from colorless to light yellow. Designations on this scale are alphabetically ordered letters from D (colorless) to Z (light yellow). The figure below shows what diamonds on this scale look like. Ratings closer to D (colorless) are generally preferred by buyers. The ideal, structurally perfect diamond crystal would be a colorless stone, and more color can give the diamond a cloudy appearance in certain light.
Source: GIA [https://www.gia.edu/images/19146.jpg]
In a boxplot of color and price, we can see that some of the highest prices are observed in the D (colorless) rating. However, there does not seem to be a clear increase or decrease in mean price as color rating improves.
A flawless lab-grown diamond crystal would be perfectly clear, but diamonds formed in nature and imperfect lab conditions contain imperfections. Inclusions are the name given to imperfections within the diamond, and flaws on the surface are called blemishes.4 These imperfections bend light reflecting through the diamond and decrease its clarity. Clarity is measured on a scale of 11 ratings from FL (Flawless) to I3 (Included). Stones with fewer imperfections are preferred as they will have more brilliance in the light.
Source: GIA [https://www.gia.edu/images/GIAClarityScale_2014_636x200.jpg]
In a boxplot of clarity and price, the FL (Flawless) category fetched the highest maximum and average price. However, there does not seem to be a clear increase or decrease in mean price across the clarity scale.
A diamond’s brilliance and glimmer is a result of its intricate cut. Gemologists study the behavior of light reflecting through a prism to make cuts that direct light in the right direction for best appearance. The Cut vs Price boxplot below shows that the best Astor Ideal cut rating corresponds to the highest mean price. Counter to our expectations though, the max price in each category seems to decrease with better cut. As with the other two categorical predictors, the relationship here is not clear or strong.
The most significant linear relationship we see is between price and carat. Intuitively, this makes sense, from common knowledge that the larger the carat, the more expensive the diamond. We want to test this relationship to see if a simple linear regression is appropriate in approximating this relationship. We will look for \(\hat\beta\) parameters for the simple linear regression equation below.
\[ y = \hat\beta_0 + \hat\beta_1x \]
Below is a scatterplot for the simple linear regression of Price against Carat.
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95743 -4363 1799 4727 1960718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12788.67 48.08 -266.0 <2e-16 ***
## carat 24051.21 45.99 522.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15100 on 210636 degrees of freedom
## Multiple R-squared: 0.5649, Adjusted R-squared: 0.5649
## F-statistic: 2.735e+05 on 1 and 210636 DF, p-value: < 2.2e-16
The regression equation for the simple linear regression of Price against Carat is: \[y = -12788.67 + 24051.21x.\] The \(R^{2}\) value is: \[R^{2} = 0.5649.\] The t statistic and p-value for this linear regression is: \[t = 522.9 \text{ and } p<2e-16.\] The F statistic and p-value for this linear regression is: \[F = 2.735e+05 \text{ and } p<2.2e-16.\] This tells us that for every unit increase in carat, price increases by about $24,051. The \(R^{2}\) value of 0.5649 tells us that 56.49% of the variance in the data can be explained by the model. A significant t test and F test tell us that carat is significant in predicting the price of a diamond, and the model is fairly useful in predicting price.
We will assess the assumptions for linear regression. Below is a residual plot of the SLR model, as well as an ACF plot and a QQ plot of the residuals. The scatterplot looks like it follows an exponential curve. There also seems to be a funnel pattern to the residual plot. The QQ norm plot also follows a curve that resembles an exponential function. Since non-linearity is not the only problem (based on the evaluation of assumptions we see the resudual plot does not have constant variance), we will transform the response variable first.
Using a Box Cox plot, we find that the ideal lambda for transformation is around 0.33. We will first try to transform the response variable using a cubed root based on this lambda.
library(MASS)
boxcox(result.car, lambda = seq(0.3,0.4,0.01))
diamonds$cube_root_price <- diamonds$price^(1/3) # Transform responses
attach(diamonds)
## The following objects are masked from diamonds (pos = 4):
##
## carat, clarity, color, cut, price
result.transformed <- lm(cube_root_price~carat) # Fit new linear regression
summary(result.transformed) # Summarize results
##
## Call:
## lm(formula = cube_root_price ~ carat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.789 -1.151 -0.380 0.758 34.350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.494582 0.007312 888.2 <2e-16 ***
## carat 9.311244 0.006995 1331.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.297 on 210636 degrees of freedom
## Multiple R-squared: 0.8938, Adjusted R-squared: 0.8938
## F-statistic: 1.772e+06 on 1 and 210636 DF, p-value: < 2.2e-16
plot(carat, cube_root_price, xlab='Carat', ylab='Cube Root of Price', main='Cube Root of Price vs Carat') # Scatterplot of transformed data and new model
abline(result.transformed) # Linear regression line
plot(result.transformed$fitted.values, result.transformed$residuals, xlab='Fitted Values', ylab='Residuals', main='Residual Plot') # Residual plot of model
abline(h=0)
acf(result.transformed$residuals) # ACF plot of residuals
qqnorm(result.transformed$residuals) # QQ plot of residuals
qqline(result.transformed$residuals)
The scatterplot and regression assumptions are slightly improved, but still not great, so we use another trasnformation, this time on the predictor variable. The scatterplot below shows the relationship between the cubed root of the response and the square root of the predictor.
diamonds$sqrt_carat <- diamonds$carat^(1/2) # Transform predictor
attach(diamonds)
## The following objects are masked from diamonds (pos = 3):
##
## carat, clarity, color, color_cat, cube_root_price, cut, price
## The following objects are masked from diamonds (pos = 5):
##
## carat, clarity, color, cut, price
result.transformed2 <- lm(cube_root_price~sqrt_carat) # Fit new linear regression
summary(result.transformed2)
##
## Call:
## lm(formula = cube_root_price ~ sqrt_carat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.310 -0.767 0.018 0.668 48.227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.05897 0.01143 -442.7 <2e-16 ***
## sqrt_carat 22.74445 0.01309 1737.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.8 on 210636 degrees of freedom
## Multiple R-squared: 0.9348, Adjusted R-squared: 0.9348
## F-statistic: 3.018e+06 on 1 and 210636 DF, p-value: < 2.2e-16
plot(sqrt_carat, cube_root_price, xlab='Sqrt Carat', ylab='Cube Root of Price', main='Cube Root of Price vs Sqrt of Carat') # Scatterplot of transformed data and new model
abline(result.transformed2) # Linear regression line
plot(result.transformed2$fitted.values, result.transformed2$residuals, xlab='Fitted Values', ylab='Residuals', main='Residual Plot') # Residual plot of model
abline(h=0)
acf(result.transformed2$residuals) # ACF plot of residuals
qqnorm(result.transformed2$residuals) # QQ plot of residuals
qqline(result.transformed2$residuals)
The scatterplot and regression assumptions are again, slightly better, but still not satisfactory. The residual plot still follows a funnel pattern and the QQ norm plot is still curved, especially for carats greater than 2.
We refit the regression model again with a transformation on both the response and predictor. Since the Box Cox interval is close to 0 and the original SLR regression line follows an exponential curve, we attempt a log-log transformation.
##
## Call:
## lm(formula = price.log ~ carat.log, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19296 -0.17193 -0.00889 0.15051 1.82515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5078289 0.0007599 11196 <2e-16 ***
## carat.log 1.9124150 0.0009547 2003 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2707 on 210636 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9501
## F-statistic: 4.013e+06 on 1 and 210636 DF, p-value: < 2.2e-16
Below is the residual plot, ACF plot and QQ norm plot of residuals to evaluate the assumptions for linear regression for our new log-log transformed model. Finally, the assumptions for linear regression are met, \(R^{2}\) is reasonable, and the scatterplot confirms a positive linear relationship. The new regression equation is \[log(y) = -12788.67 + 24051.21(log(x))\] \[y = e^{-12788.67}e^{24051.21(log(x))}\] \[ = e^{-12788.67}x^{24051.21}\] Also notice, \[R^{2} = 0.9501\]
The \(R^{2}\) indicates a much stronger linear relationship, as does the scatterplot. Thus, we will use this log-log regression equation for our analysis.
Interpretation of slopes after a log transform of predictor and the response is as follows: When the predictor is multiplied by a factor of \(a\), the response increases by \((a^{\beta_1} - 1)*100\)%. This is an increase of approximately \(\beta_1\)% in the response for every 1% increase in the predictor.
Here we explore the relationship between price and carat split out by other predictors. First we use cut as our categorical variable.
## Ideal Very Good Good
## Astor Ideal 0 0 0
## Ideal 1 0 0
## Very Good 0 1 0
## Good 0 0 1
The scatterplot shows that there is a positive linear relationship between carat and price for all four cut types when transformed using a log-log transformation. Good, Very Good, and Ideal have similar slopes, while Astor Ideal has a slope closer to zero. This could be explained by the large concentration of Astor ideal points being low in carats.
We now categorize color into two caateorgies based on the categories given by the diamond quality factors reserach from the Gemological Institute of America. D, E and F are “Colorless” and G, H, I and J are “Near Colorless.” This categorization technique makes it easier to see the effects of color on our simple linear regression. There is a similar positive relationship between colorless and near colorless color categories. The slopes for colorless and near colorless are very similar, with near colorless being slightly closer to zero. These similar slopes indicated that there may be little interaction between carat and color as predictors in the same model
We use the confidence and prediction intervals to estimate the mean response based on the model when carat=1.1 (fairly low carats) and carat=9.09 (fairly high carats). These intervals are made using the model with no transformations.
## fit lwr upr
## 1 13667.66 13596.33 13739
## fit lwr upr
## 1 13667.66 -15936.71 43272.04
## fit lwr upr
## 1 205836.8 205083.3 206590.3
## fit lwr upr
## 1 205836.8 176223 235450.7
The results are as follows:
Actual price for 1.1 carat: $3,851 Predicted mean price for 1.1 carat: $13,667.66 95% confidence interval: (13596.33, 13739)
95% prediction interval: (-15936.71, 43272.04)
Actual price for 9.09 carat: $258,497 Predicted mean price for 9.09 carat: $205,836.8 95% confidence interval: (205083.3, 206590.3) 95% prediction interval: (176223, 235450.7)
To determine how well all of the predictors available to us can account for diamond price variation, we fit a full linear regression model to the data. Duing single linear regression, we confirmed that a log transformation of price and carat produced the best linear model, so we also used that transformation here. The model shows low standard errors, and p-values indicate that every predictor is significant given the presence of the predictors before them. The \(R^2\) value indicates that this model accounts for 98.1% 56.5 of the variation in our data. This is an increase of 41.6% over the \(R^2\) of our Simple Linear Regression.
##
## Call:
## lm(formula = log(price) ~ log(carat) + clarity + cut + color,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95230 -0.10643 -0.00626 0.10309 1.24026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3499089 0.0060069 1556.53 <2e-16 ***
## log(carat) 1.9654032 0.0006011 3269.73 <2e-16 ***
## clarityIF -0.2602883 0.0055576 -46.83 <2e-16 ***
## clarityVVS1 -0.6385266 0.0053676 -118.96 <2e-16 ***
## clarityVVS2 -0.7748139 0.0054166 -143.04 <2e-16 ***
## clarityVS1 -0.4691355 0.0053727 -87.32 <2e-16 ***
## clarityVS2 -0.5272827 0.0053771 -98.06 <2e-16 ***
## claritySI1 -0.3494636 0.0053965 -64.76 <2e-16 ***
## claritySI2 -0.4200952 0.0053925 -77.90 <2e-16 ***
## cutIdeal -0.2986523 0.0031754 -94.05 <2e-16 ***
## cutVery Good -0.0822843 0.0028952 -28.42 <2e-16 ***
## cutGood -0.2527184 0.0029324 -86.18 <2e-16 ***
## colorE -0.0622330 0.0012591 -49.43 <2e-16 ***
## colorF -0.0947113 0.0012528 -75.60 <2e-16 ***
## colorG -0.1534624 0.0012758 -120.29 <2e-16 ***
## colorH -0.2166832 0.0013534 -160.10 <2e-16 ***
## colorI -0.3172295 0.0013738 -230.92 <2e-16 ***
## colorJ -0.4458935 0.0016000 -278.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1656 on 210620 degrees of freedom
## Multiple R-squared: 0.9813, Adjusted R-squared: 0.9813
## F-statistic: 6.517e+05 on 17 and 210620 DF, p-value: < 2.2e-16
Thus far, there has been no evidence of multicollinearity, but we will also verify this with a VIF plot. All of the VIF values are close to 1, so we conclude that there is no multicollinearity.
## Loading required package: carData
## GVIF Df GVIF^(1/(2*Df))
## log(carat) 1.059753 1 1.029443
## clarity 1.041533 7 1.002911
## cut 1.057626 3 1.009382
## color 1.037056 6 1.003037
Similar results to the Simple Linear Regression model were found when verifying normality assumptions for this Multiple Linear Regression. After our log transformations of price and carat, the residual plot shows constant variance and constant mean. An ACF plot indicates that our residuals are independent. And a QQ-plot shows that the residuals are more or less normally distributed. We are satisfied with this model and no more transformations are needed.
Residual plot
ACF plot of residuals
QQ plot of residuals
Next we attempt to reduce our model by removing predictors one by one. We then use a partial F test where we fit and compare the full and reduced regression models to determine if any predictors can be dropped from the model. Here we are testing the null hypothesis that all dropped predictors have a slope of 0 against the alternative hypothesis that not all the dropped predictors have a slope of 0. \[ H_0: \text{all removed }\beta = 0, H_a: \text{not all removed }\beta = 0 \]
After each round of model reduction, we run an analysis of variance between the full model and the reduced model. The F-value from ANOVA is calculated using the equation below. This value is then compared to a critical F-value to obtain a p-value. \[ F_0 = \frac{MS_R}{MS_{Res}} \] Three reduced model anova comparisons are shown below. First for removing color, then cut, then clarity.
The low p-values for each of these tests indicates that we can reject the null hypothesis each time – the removed predictor was useful in our model. Because all our predictors are useful, we will go with the full regression model instead of the reduced model.
Since we have already investigated multiple linear regression models without interaction, we may consider the possibility of interaction between predictors. First we will consider interaction between the transformed carat predictor and each categorical variable. From an understanding of diamonds and pricing, it is reasonable to intuit that there will be meaningful interaction between levels of the categorical predictors, but we will also test this intuition rigorously.
##
## Call:
## lm(formula = log(price) ~ log(carat) * cut, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33893 -0.16486 -0.01298 0.14686 1.65222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.638178 0.005780 1494.509 < 2e-16 ***
## log(carat) 1.836689 0.008486 216.444 < 2e-16 ***
## cutIdeal -0.310010 0.006259 -49.529 < 2e-16 ***
## cutVery Good -0.036061 0.005867 -6.146 7.95e-10 ***
## cutGood -0.227155 0.005892 -38.552 < 2e-16 ***
## log(carat):cutIdeal 0.055547 0.009229 6.019 1.76e-09 ***
## log(carat):cutVery Good 0.124528 0.008569 14.533 < 2e-16 ***
## log(carat):cutGood 0.084844 0.008634 9.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2543 on 210630 degrees of freedom
## Multiple R-squared: 0.956, Adjusted R-squared: 0.956
## F-statistic: 6.541e+05 on 7 and 210630 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(price) ~ log(carat) * color_cat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04326 -0.15818 -0.00078 0.14813 1.59136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.640470 0.001034 8356.27 <2e-16 ***
## log(carat) 1.975188 0.001272 1553.37 <2e-16 ***
## color_catNear Colorless -0.249673 0.001408 -177.36 <2e-16 ***
## log(carat):color_catNear Colorless -0.097489 0.001768 -55.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2498 on 210634 degrees of freedom
## Multiple R-squared: 0.9576, Adjusted R-squared: 0.9575
## F-statistic: 1.584e+06 on 3 and 210634 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(price) ~ log(carat) * clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14364 -0.13057 0.00538 0.13094 1.35743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.224942 0.007723 1194.515 < 2e-16 ***
## log(carat) 1.985840 0.008158 243.431 < 2e-16 ***
## clarityIF -0.391652 0.008308 -47.140 < 2e-16 ***
## clarityVVS1 -0.844276 0.007847 -107.590 < 2e-16 ***
## clarityVVS2 -1.023456 0.007931 -129.045 < 2e-16 ***
## clarityVS1 -0.657286 0.007852 -83.711 < 2e-16 ***
## clarityVS2 -0.725124 0.007854 -92.322 < 2e-16 ***
## claritySI1 -0.528204 0.007943 -66.495 < 2e-16 ***
## claritySI2 -0.614084 0.007914 -77.597 < 2e-16 ***
## log(carat):clarityIF 0.014858 0.008874 1.674 0.094056 .
## log(carat):clarityVVS1 -0.087092 0.008359 -10.419 < 2e-16 ***
## log(carat):clarityVVS2 -0.117527 0.008486 -13.850 < 2e-16 ***
## log(carat):clarityVS1 -0.053048 0.008352 -6.351 2.14e-10 ***
## log(carat):clarityVS2 -0.072607 0.008363 -8.682 < 2e-16 ***
## log(carat):claritySI1 -0.027974 0.008454 -3.309 0.000936 ***
## log(carat):claritySI2 -0.047785 0.008429 -5.669 1.44e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2226 on 210622 degrees of freedom
## Multiple R-squared: 0.9663, Adjusted R-squared: 0.9663
## F-statistic: 4.025e+05 on 15 and 210622 DF, p-value: < 2.2e-16
Each of these models appears to be reasonably well-fitted, and the residual plots appear sufficient by inspection. Upon closer inspection into the summaries, we see that there are a number of terms with insignificant t-test p-values, indicating that they perhaps do not meaningfully contribute to the model, and that more investigation will be needed.
It is perhaps a natural question, then, to ask what would happen if we considered a model with interaction between all possible predictors? We will call this the “full interaction model”.
##
## Call:
## lm(formula = log(price) ~ log(carat) * cut * color_cat * clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98400 -0.10817 -0.00067 0.10648 1.06432
##
## Coefficients: (5 not defined because of singularities)
## Estimate
## (Intercept) 9.375e+00
## log(carat) 2.126e+00
## cutIdeal -6.839e-01
## cutVery Good -4.211e-02
## cutGood -4.013e-01
## color_catNear Colorless -5.924e-01
## clarityIF 1.058e-02
## clarityVVS1 -8.081e-01
## clarityVVS2 -9.677e-01
## clarityVS1 -5.072e-01
## clarityVS2 -6.696e-01
## claritySI1 -2.270e-01
## claritySI2 -4.057e-01
## log(carat):cutIdeal 2.105e-01
## log(carat):cutVery Good -7.139e-02
## log(carat):cutGood 4.101e-02
## log(carat):color_catNear Colorless -2.017e-01
## cutIdeal:color_catNear Colorless 1.462e-01
## cutVery Good:color_catNear Colorless 8.390e-02
## cutGood:color_catNear Colorless 1.443e-01
## log(carat):clarityIF 2.702e-01
## log(carat):clarityVVS1 -2.272e-01
## log(carat):clarityVVS2 -4.221e-01
## log(carat):clarityVS1 -1.320e-01
## log(carat):clarityVS2 -2.535e-01
## log(carat):claritySI1 -5.936e-02
## log(carat):claritySI2 -1.405e-01
## cutIdeal:clarityIF 1.498e-01
## cutVery Good:clarityIF -2.215e-01
## cutGood:clarityIF -1.275e-01
## cutIdeal:clarityVVS1 4.411e-01
## cutVery Good:clarityVVS1 4.152e-02
## cutGood:clarityVVS1 2.080e-01
## cutIdeal:clarityVVS2 4.382e-01
## cutVery Good:clarityVVS2 1.463e-02
## cutGood:clarityVVS2 2.021e-01
## cutIdeal:clarityVS1 3.417e-01
## cutVery Good:clarityVS1 -2.630e-02
## cutGood:clarityVS1 1.273e-01
## cutIdeal:clarityVS2 4.399e-01
## cutVery Good:clarityVS2 4.510e-02
## cutGood:clarityVS2 2.139e-01
## cutIdeal:claritySI1 1.684e-01
## cutVery Good:claritySI1 -1.150e-01
## cutGood:claritySI1 -2.703e-02
## cutIdeal:claritySI2 2.804e-01
## cutVery Good:claritySI2 -5.697e-02
## cutGood:claritySI2 7.756e-02
## color_catNear Colorless:clarityIF -1.783e-01
## color_catNear Colorless:clarityVVS1 4.653e-01
## color_catNear Colorless:clarityVVS2 3.426e-01
## color_catNear Colorless:clarityVS1 2.636e-01
## color_catNear Colorless:clarityVS2 4.104e-01
## color_catNear Colorless:claritySI1 2.841e-02
## color_catNear Colorless:claritySI2 1.853e-01
## log(carat):cutIdeal:color_catNear Colorless 5.551e-02
## log(carat):cutVery Good:color_catNear Colorless 8.294e-02
## log(carat):cutGood:color_catNear Colorless 5.900e-02
## log(carat):cutIdeal:clarityIF -4.177e-01
## log(carat):cutVery Good:clarityIF -2.723e-01
## log(carat):cutGood:clarityIF -3.656e-01
## log(carat):cutIdeal:clarityVVS1 -2.063e-01
## log(carat):cutVery Good:clarityVVS1 1.853e-01
## log(carat):cutGood:clarityVVS1 -1.472e-02
## log(carat):cutIdeal:clarityVVS2 -7.592e-02
## log(carat):cutVery Good:clarityVVS2 3.581e-01
## log(carat):cutGood:clarityVVS2 1.318e-01
## log(carat):cutIdeal:clarityVS1 -1.955e-01
## log(carat):cutVery Good:clarityVS1 1.177e-01
## log(carat):cutGood:clarityVS1 -4.706e-05
## log(carat):cutIdeal:clarityVS2 -9.235e-02
## log(carat):cutVery Good:clarityVS2 2.269e-01
## log(carat):cutGood:clarityVS2 6.807e-02
## log(carat):cutIdeal:claritySI1 -2.759e-01
## log(carat):cutVery Good:claritySI1 2.941e-02
## log(carat):cutGood:claritySI1 -5.284e-02
## log(carat):cutIdeal:claritySI2 -1.991e-01
## log(carat):cutVery Good:claritySI2 1.048e-01
## log(carat):cutGood:claritySI2 NA
## log(carat):color_catNear Colorless:clarityIF -4.745e-01
## log(carat):color_catNear Colorless:clarityVVS1 1.367e-01
## log(carat):color_catNear Colorless:clarityVVS2 2.091e-01
## log(carat):color_catNear Colorless:clarityVS1 -1.777e-02
## log(carat):color_catNear Colorless:clarityVS2 1.485e-01
## log(carat):color_catNear Colorless:claritySI1 -1.100e-01
## log(carat):color_catNear Colorless:claritySI2 1.313e-02
## cutIdeal:color_catNear Colorless:clarityIF 1.263e-01
## cutVery Good:color_catNear Colorless:clarityIF 1.957e-01
## cutGood:color_catNear Colorless:clarityIF 2.087e-01
## cutIdeal:color_catNear Colorless:clarityVVS1 -1.653e-01
## cutVery Good:color_catNear Colorless:clarityVVS1 -1.320e-01
## cutGood:color_catNear Colorless:clarityVVS1 -1.640e-01
## cutIdeal:color_catNear Colorless:clarityVVS2 -1.233e-02
## cutVery Good:color_catNear Colorless:clarityVVS2 1.715e-02
## cutGood:color_catNear Colorless:clarityVVS2 -1.297e-02
## cutIdeal:color_catNear Colorless:clarityVS1 -5.518e-02
## cutVery Good:color_catNear Colorless:clarityVS1 -1.542e-02
## cutGood:color_catNear Colorless:clarityVS1 -4.537e-02
## cutIdeal:color_catNear Colorless:clarityVS2 -1.650e-01
## cutVery Good:color_catNear Colorless:clarityVS2 -1.186e-01
## cutGood:color_catNear Colorless:clarityVS2 -1.611e-01
## cutIdeal:color_catNear Colorless:claritySI1 8.996e-02
## cutVery Good:color_catNear Colorless:claritySI1 1.065e-01
## cutGood:color_catNear Colorless:claritySI1 1.006e-01
## cutIdeal:color_catNear Colorless:claritySI2 NA
## cutVery Good:color_catNear Colorless:claritySI2 3.460e-03
## cutGood:color_catNear Colorless:claritySI2 NA
## log(carat):cutIdeal:color_catNear Colorless:clarityIF 2.310e-01
## log(carat):cutVery Good:color_catNear Colorless:clarityIF 4.760e-01
## log(carat):cutGood:color_catNear Colorless:clarityIF 4.220e-01
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS1 -6.444e-02
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS1 -9.433e-02
## log(carat):cutGood:color_catNear Colorless:clarityVVS1 -6.134e-02
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS2 -7.750e-02
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS2 -1.642e-01
## log(carat):cutGood:color_catNear Colorless:clarityVVS2 -1.059e-01
## log(carat):cutIdeal:color_catNear Colorless:clarityVS1 1.954e-02
## log(carat):cutVery Good:color_catNear Colorless:clarityVS1 4.643e-02
## log(carat):cutGood:color_catNear Colorless:clarityVS1 3.631e-02
## log(carat):cutIdeal:color_catNear Colorless:clarityVS2 -1.174e-01
## log(carat):cutVery Good:color_catNear Colorless:clarityVS2 -1.144e-01
## log(carat):cutGood:color_catNear Colorless:clarityVS2 -1.079e-01
## log(carat):cutIdeal:color_catNear Colorless:claritySI1 1.029e-01
## log(carat):cutVery Good:color_catNear Colorless:claritySI1 1.813e-01
## log(carat):cutGood:color_catNear Colorless:claritySI1 8.963e-02
## log(carat):cutIdeal:color_catNear Colorless:claritySI2 NA
## log(carat):cutVery Good:color_catNear Colorless:claritySI2 3.228e-02
## log(carat):cutGood:color_catNear Colorless:claritySI2 NA
## Std. Error t value
## (Intercept) 1.798e-01 52.131
## log(carat) 2.908e-02 73.103
## cutIdeal 2.105e-01 -3.249
## cutVery Good 1.800e-01 -0.234
## cutGood 1.790e-01 -2.242
## color_catNear Colorless 4.503e-02 -13.155
## clarityIF 1.893e-01 0.056
## clarityVVS1 1.804e-01 -4.481
## clarityVVS2 1.905e-01 -5.080
## clarityVS1 1.803e-01 -2.814
## clarityVS2 1.802e-01 -3.716
## claritySI1 1.807e-01 -1.256
## claritySI2 1.789e-01 -2.268
## log(carat):cutIdeal 9.605e-02 2.192
## log(carat):cutVery Good 3.015e-02 -2.368
## log(carat):cutGood 2.208e-02 1.857
## log(carat):color_catNear Colorless 6.295e-02 -3.204
## cutIdeal:color_catNear Colorless 2.431e-02 6.013
## cutVery Good:color_catNear Colorless 5.718e-02 1.467
## cutGood:color_catNear Colorless 2.265e-02 6.368
## log(carat):clarityIF 1.019e-01 2.651
## log(carat):clarityVVS1 3.630e-02 -6.258
## log(carat):clarityVVS2 9.758e-02 -4.326
## log(carat):clarityVS1 3.363e-02 -3.927
## log(carat):clarityVS2 3.348e-02 -7.572
## log(carat):claritySI1 3.735e-02 -1.589
## log(carat):claritySI2 1.942e-02 -7.233
## cutIdeal:clarityIF 2.190e-01 0.684
## cutVery Good:clarityIF 1.895e-01 -1.169
## cutGood:clarityIF 1.886e-01 -0.676
## cutIdeal:clarityVVS1 2.110e-01 2.091
## cutVery Good:clarityVVS1 1.805e-01 0.230
## cutGood:clarityVVS1 1.795e-01 1.159
## cutIdeal:clarityVVS2 2.198e-01 1.994
## cutVery Good:clarityVVS2 1.907e-01 0.077
## cutGood:clarityVVS2 1.897e-01 1.065
## cutIdeal:clarityVS1 2.109e-01 1.620
## cutVery Good:clarityVS1 1.804e-01 -0.146
## cutGood:clarityVS1 1.794e-01 0.709
## cutIdeal:clarityVS2 2.109e-01 2.086
## cutVery Good:clarityVS2 1.804e-01 0.250
## cutGood:clarityVS2 1.794e-01 1.192
## cutIdeal:claritySI1 2.114e-01 0.797
## cutVery Good:claritySI1 1.808e-01 -0.636
## cutGood:claritySI1 1.798e-01 -0.150
## cutIdeal:claritySI2 2.098e-01 1.337
## cutVery Good:claritySI2 1.791e-01 -0.318
## cutGood:claritySI2 1.780e-01 0.436
## color_catNear Colorless:clarityIF 8.210e-02 -2.172
## color_catNear Colorless:clarityVVS1 4.887e-02 9.520
## color_catNear Colorless:clarityVVS2 9.364e-02 3.659
## color_catNear Colorless:clarityVS1 4.772e-02 5.524
## color_catNear Colorless:clarityVS2 4.784e-02 8.579
## color_catNear Colorless:claritySI1 4.986e-02 0.570
## color_catNear Colorless:claritySI2 3.918e-02 4.731
## log(carat):cutIdeal:color_catNear Colorless 3.172e-02 1.750
## log(carat):cutVery Good:color_catNear Colorless 8.076e-02 1.027
## log(carat):cutGood:color_catNear Colorless 2.918e-02 2.022
## log(carat):cutIdeal:clarityIF 1.380e-01 -3.026
## log(carat):cutVery Good:clarityIF 1.023e-01 -2.661
## log(carat):cutGood:clarityIF 1.004e-01 -3.642
## log(carat):cutIdeal:clarityVVS1 9.877e-02 -2.089
## log(carat):cutVery Good:clarityVVS1 3.727e-02 4.971
## log(carat):cutGood:clarityVVS1 3.118e-02 -0.472
## log(carat):cutIdeal:clarityVVS2 1.342e-01 -0.566
## log(carat):cutVery Good:clarityVVS2 9.799e-02 3.654
## log(carat):cutGood:clarityVVS2 9.585e-02 1.375
## log(carat):cutIdeal:clarityVS1 9.791e-02 -1.996
## log(carat):cutVery Good:clarityVS1 3.466e-02 3.397
## log(carat):cutGood:clarityVS1 2.802e-02 -0.002
## log(carat):cutIdeal:clarityVS2 9.780e-02 -0.944
## log(carat):cutVery Good:clarityVS2 3.453e-02 6.571
## log(carat):cutGood:clarityVS2 2.785e-02 2.444
## log(carat):cutIdeal:claritySI1 9.954e-02 -2.772
## log(carat):cutVery Good:claritySI1 3.831e-02 0.768
## log(carat):cutGood:claritySI1 3.252e-02 -1.625
## log(carat):cutIdeal:claritySI2 9.408e-02 -2.117
## log(carat):cutVery Good:claritySI2 2.121e-02 4.943
## log(carat):cutGood:claritySI2 NA NA
## log(carat):color_catNear Colorless:clarityIF 1.235e-01 -3.842
## log(carat):color_catNear Colorless:clarityVVS1 7.016e-02 1.948
## log(carat):color_catNear Colorless:clarityVVS2 1.351e-01 1.547
## log(carat):color_catNear Colorless:clarityVS1 6.707e-02 -0.265
## log(carat):color_catNear Colorless:clarityVS2 6.728e-02 2.207
## log(carat):color_catNear Colorless:claritySI1 6.996e-02 -1.572
## log(carat):color_catNear Colorless:claritySI2 5.611e-02 0.234
## cutIdeal:color_catNear Colorless:clarityIF 7.604e-02 1.662
## cutVery Good:color_catNear Colorless:clarityIF 8.958e-02 2.184
## cutGood:color_catNear Colorless:clarityIF 7.276e-02 2.869
## cutIdeal:color_catNear Colorless:clarityVVS1 3.156e-02 -5.239
## cutVery Good:color_catNear Colorless:clarityVVS1 6.033e-02 -2.187
## cutGood:color_catNear Colorless:clarityVVS1 2.976e-02 -5.510
## cutIdeal:color_catNear Colorless:clarityVVS2 8.604e-02 -0.143
## cutVery Good:color_catNear Colorless:clarityVVS2 1.001e-01 0.171
## cutGood:color_catNear Colorless:clarityVVS2 8.529e-02 -0.152
## cutIdeal:color_catNear Colorless:clarityVS1 2.995e-02 -1.843
## cutVery Good:color_catNear Colorless:clarityVS1 5.940e-02 -0.260
## cutGood:color_catNear Colorless:clarityVS1 2.785e-02 -1.629
## cutIdeal:color_catNear Colorless:clarityVS2 3.011e-02 -5.479
## cutVery Good:color_catNear Colorless:clarityVS2 5.950e-02 -1.994
## cutGood:color_catNear Colorless:clarityVS2 2.804e-02 -5.744
## cutIdeal:color_catNear Colorless:claritySI1 3.462e-02 2.599
## cutVery Good:color_catNear Colorless:claritySI1 6.118e-02 1.742
## cutGood:color_catNear Colorless:claritySI1 3.153e-02 3.191
## cutIdeal:color_catNear Colorless:claritySI2 NA NA
## cutVery Good:color_catNear Colorless:claritySI2 5.281e-02 0.066
## cutGood:color_catNear Colorless:claritySI2 NA NA
## log(carat):cutIdeal:color_catNear Colorless:clarityIF 1.149e-01 2.011
## log(carat):cutVery Good:color_catNear Colorless:clarityIF 1.336e-01 3.562
## log(carat):cutGood:color_catNear Colorless:clarityIF 1.107e-01 3.812
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS1 4.555e-02 -1.415
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS1 8.658e-02 -1.089
## log(carat):cutGood:color_catNear Colorless:clarityVVS1 4.285e-02 -1.432
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS2 1.245e-01 -0.623
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS2 1.444e-01 -1.138
## log(carat):cutGood:color_catNear Colorless:clarityVVS2 1.232e-01 -0.859
## log(carat):cutIdeal:color_catNear Colorless:clarityVS1 4.093e-02 0.477
## log(carat):cutVery Good:color_catNear Colorless:clarityVS1 8.409e-02 0.552
## log(carat):cutGood:color_catNear Colorless:clarityVS1 3.756e-02 0.967
## log(carat):cutIdeal:color_catNear Colorless:clarityVS2 4.119e-02 -2.851
## log(carat):cutVery Good:color_catNear Colorless:clarityVS2 8.427e-02 -1.357
## log(carat):cutGood:color_catNear Colorless:clarityVS2 3.795e-02 -2.844
## log(carat):cutIdeal:color_catNear Colorless:claritySI1 4.719e-02 2.181
## log(carat):cutVery Good:color_catNear Colorless:claritySI1 8.645e-02 2.097
## log(carat):cutGood:color_catNear Colorless:claritySI1 4.272e-02 2.098
## log(carat):cutIdeal:color_catNear Colorless:claritySI2 NA NA
## log(carat):cutVery Good:color_catNear Colorless:claritySI2 7.566e-02 0.427
## log(carat):cutGood:color_catNear Colorless:claritySI2 NA NA
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## log(carat) < 2e-16 ***
## cutIdeal 0.001159 **
## cutVery Good 0.815031
## cutGood 0.024947 *
## color_catNear Colorless < 2e-16 ***
## clarityIF 0.955432
## clarityVVS1 7.44e-06 ***
## clarityVVS2 3.77e-07 ***
## clarityVS1 0.004893 **
## clarityVS2 0.000203 ***
## claritySI1 0.208983
## claritySI2 0.023340 *
## log(carat):cutIdeal 0.028407 *
## log(carat):cutVery Good 0.017882 *
## log(carat):cutGood 0.063308 .
## log(carat):color_catNear Colorless 0.001354 **
## cutIdeal:color_catNear Colorless 1.83e-09 ***
## cutVery Good:color_catNear Colorless 0.142248
## cutGood:color_catNear Colorless 1.91e-10 ***
## log(carat):clarityIF 0.008021 **
## log(carat):clarityVVS1 3.90e-10 ***
## log(carat):clarityVVS2 1.52e-05 ***
## log(carat):clarityVS1 8.60e-05 ***
## log(carat):clarityVS2 3.69e-14 ***
## log(carat):claritySI1 0.111993
## log(carat):claritySI2 4.72e-13 ***
## cutIdeal:clarityIF 0.493825
## cutVery Good:clarityIF 0.242545
## cutGood:clarityIF 0.499126
## cutIdeal:clarityVVS1 0.036556 *
## cutVery Good:clarityVVS1 0.818085
## cutGood:clarityVVS1 0.246597
## cutIdeal:clarityVVS2 0.046160 *
## cutVery Good:clarityVVS2 0.938850
## cutGood:clarityVVS2 0.286689
## cutIdeal:clarityVS1 0.105237
## cutVery Good:clarityVS1 0.884126
## cutGood:clarityVS1 0.478199
## cutIdeal:clarityVS2 0.036999 *
## cutVery Good:clarityVS2 0.802585
## cutGood:clarityVS2 0.233090
## cutIdeal:claritySI1 0.425560
## cutVery Good:claritySI1 0.524994
## cutGood:claritySI1 0.880523
## cutIdeal:claritySI2 0.181326
## cutVery Good:claritySI2 0.750367
## cutGood:claritySI2 0.662997
## color_catNear Colorless:clarityIF 0.029831 *
## color_catNear Colorless:clarityVVS1 < 2e-16 ***
## color_catNear Colorless:clarityVVS2 0.000253 ***
## color_catNear Colorless:clarityVS1 3.32e-08 ***
## color_catNear Colorless:clarityVS2 < 2e-16 ***
## color_catNear Colorless:claritySI1 0.568862
## color_catNear Colorless:claritySI2 2.24e-06 ***
## log(carat):cutIdeal:color_catNear Colorless 0.080182 .
## log(carat):cutVery Good:color_catNear Colorless 0.304388
## log(carat):cutGood:color_catNear Colorless 0.043190 *
## log(carat):cutIdeal:clarityIF 0.002474 **
## log(carat):cutVery Good:clarityIF 0.007787 **
## log(carat):cutGood:clarityIF 0.000271 ***
## log(carat):cutIdeal:clarityVVS1 0.036736 *
## log(carat):cutVery Good:clarityVVS1 6.66e-07 ***
## log(carat):cutGood:clarityVVS1 0.636964
## log(carat):cutIdeal:clarityVVS2 0.571563
## log(carat):cutVery Good:clarityVVS2 0.000258 ***
## log(carat):cutGood:clarityVVS2 0.169012
## log(carat):cutIdeal:clarityVS1 0.045902 *
## log(carat):cutVery Good:clarityVS1 0.000680 ***
## log(carat):cutGood:clarityVS1 0.998660
## log(carat):cutIdeal:clarityVS2 0.345050
## log(carat):cutVery Good:clarityVS2 5.00e-11 ***
## log(carat):cutGood:clarityVS2 0.014509 *
## log(carat):cutIdeal:claritySI1 0.005576 **
## log(carat):cutVery Good:claritySI1 0.442752
## log(carat):cutGood:claritySI1 0.104243
## log(carat):cutIdeal:claritySI2 0.034303 *
## log(carat):cutVery Good:claritySI2 7.68e-07 ***
## log(carat):cutGood:claritySI2 NA
## log(carat):color_catNear Colorless:clarityIF 0.000122 ***
## log(carat):color_catNear Colorless:clarityVVS1 0.051429 .
## log(carat):color_catNear Colorless:clarityVVS2 0.121747
## log(carat):color_catNear Colorless:clarityVS1 0.791079
## log(carat):color_catNear Colorless:clarityVS2 0.027330 *
## log(carat):color_catNear Colorless:claritySI1 0.115873
## log(carat):color_catNear Colorless:claritySI2 0.814907
## cutIdeal:color_catNear Colorless:clarityIF 0.096583 .
## cutVery Good:color_catNear Colorless:clarityIF 0.028936 *
## cutGood:color_catNear Colorless:clarityIF 0.004123 **
## cutIdeal:color_catNear Colorless:clarityVVS1 1.62e-07 ***
## cutVery Good:color_catNear Colorless:clarityVVS1 0.028714 *
## cutGood:color_catNear Colorless:clarityVVS1 3.59e-08 ***
## cutIdeal:color_catNear Colorless:clarityVVS2 0.886084
## cutVery Good:color_catNear Colorless:clarityVVS2 0.864057
## cutGood:color_catNear Colorless:clarityVVS2 0.879124
## cutIdeal:color_catNear Colorless:clarityVS1 0.065369 .
## cutVery Good:color_catNear Colorless:clarityVS1 0.795220
## cutGood:color_catNear Colorless:clarityVS1 0.103269
## cutIdeal:color_catNear Colorless:clarityVS2 4.27e-08 ***
## cutVery Good:color_catNear Colorless:clarityVS2 0.046165 *
## cutGood:color_catNear Colorless:clarityVS2 9.28e-09 ***
## cutIdeal:color_catNear Colorless:claritySI1 0.009358 **
## cutVery Good:color_catNear Colorless:claritySI1 0.081593 .
## cutGood:color_catNear Colorless:claritySI1 0.001420 **
## cutIdeal:color_catNear Colorless:claritySI2 NA
## cutVery Good:color_catNear Colorless:claritySI2 0.947768
## cutGood:color_catNear Colorless:claritySI2 NA
## log(carat):cutIdeal:color_catNear Colorless:clarityIF 0.044316 *
## log(carat):cutVery Good:color_catNear Colorless:clarityIF 0.000369 ***
## log(carat):cutGood:color_catNear Colorless:clarityIF 0.000138 ***
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS1 0.157208
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS1 0.275956
## log(carat):cutGood:color_catNear Colorless:clarityVVS1 0.152223
## log(carat):cutIdeal:color_catNear Colorless:clarityVVS2 0.533466
## log(carat):cutVery Good:color_catNear Colorless:clarityVVS2 0.255300
## log(carat):cutGood:color_catNear Colorless:clarityVVS2 0.390390
## log(carat):cutIdeal:color_catNear Colorless:clarityVS1 0.633032
## log(carat):cutVery Good:color_catNear Colorless:clarityVS1 0.580868
## log(carat):cutGood:color_catNear Colorless:clarityVS1 0.333647
## log(carat):cutIdeal:color_catNear Colorless:clarityVS2 0.004354 **
## log(carat):cutVery Good:color_catNear Colorless:clarityVS2 0.174673
## log(carat):cutGood:color_catNear Colorless:clarityVS2 0.004461 **
## log(carat):cutIdeal:color_catNear Colorless:claritySI1 0.029201 *
## log(carat):cutVery Good:color_catNear Colorless:claritySI1 0.036011 *
## log(carat):cutGood:color_catNear Colorless:claritySI1 0.035912 *
## log(carat):cutIdeal:color_catNear Colorless:claritySI2 NA
## log(carat):cutVery Good:color_catNear Colorless:claritySI2 0.669608
## log(carat):cutGood:color_catNear Colorless:claritySI2 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1731 on 210515 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9796
## F-statistic: 8.298e+04 on 122 and 210515 DF, p-value: < 2.2e-16
If we construct a model with interaction between all variables, we see that the model contains 79 variables, too many to meaningfully talk about, even though this model is very rigorous. Additionally, by looking at the summary() results, most of the variables have insignificant t-values. In fact, only 10 of the variables have p-values less than 0.1. However, the model’s F-statistic is very high, indicating that this model provides a better fit to the data than one with no independent variables.
We perform a partial F-test to examine the significance of the interaction terms, and determine whether to keep using the interaction models instead of those without interaction. For this test, the null and alternative hypotheses are \(H_{0}: \beta_{i}=\beta_{i+1}=\ldots=\beta_{k}=0 \text{, and } H_{a}: \text{at least one of } \beta_{i},\beta_{i+1},\ldots,\beta_{k} \neq 0\), respectively, where \(\beta_{i},\ldots,\beta_{k}\) are the slopes for the predictors present in the full interaction model, but absent from the model without interaction.We see from the ANOVA results that the calculated F-statistic is 1.6468, and the p-value for our partial F-test is .0007983, indicating that we reject the null hypothesis defined above in favor of the alternative. Contextually, this means that there exists statistical significance between the models, and evidence points to the conclusion that we should indeed continue to use the model with full interaction, over the model with no interaction.
We now introduce a more reasonable version of the above model, which we will refer to as the “Reduced Interaction Model” (RIM). The categories defined for clarity seem to introduce the most insignificant variables, so in our reduced model, we will omit that interaction.
##
## Call:
## lm(formula = log(price) ~ log(carat) * cut * color_cat + clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95738 -0.11268 -0.00305 0.11061 1.14939
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 9.343302 0.008637 1081.719
## log(carat) 1.946047 0.008811 220.856
## cutIdeal -0.325336 0.006906 -47.110
## cutVery Good -0.048613 0.006549 -7.423
## cutGood -0.263633 0.006564 -40.165
## color_catNear Colorless -0.300779 0.008330 -36.110
## clarityIF -0.294981 0.005999 -49.171
## clarityVVS1 -0.669646 0.005785 -115.756
## clarityVVS2 -0.814151 0.005836 -139.514
## clarityVS1 -0.502184 0.005793 -86.694
## clarityVS2 -0.560080 0.005795 -96.656
## claritySI1 -0.390352 0.005819 -67.085
## claritySI2 -0.458443 0.005816 -78.828
## log(carat):cutIdeal 0.023744 0.009556 2.485
## log(carat):cutVery Good 0.080479 0.008896 9.047
## log(carat):cutGood 0.044616 0.008958 4.980
## log(carat):color_catNear Colorless -0.151359 0.012043 -12.568
## cutIdeal:color_catNear Colorless 0.091359 0.008994 10.157
## cutVery Good:color_catNear Colorless 0.039565 0.008452 4.681
## cutGood:color_catNear Colorless 0.086033 0.008486 10.138
## log(carat):cutIdeal:color_catNear Colorless 0.040228 0.013088 3.074
## log(carat):cutVery Good:color_catNear Colorless 0.077003 0.012161 6.332
## log(carat):cutGood:color_catNear Colorless 0.044216 0.012253 3.609
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## log(carat) < 2e-16 ***
## cutIdeal < 2e-16 ***
## cutVery Good 1.15e-13 ***
## cutGood < 2e-16 ***
## color_catNear Colorless < 2e-16 ***
## clarityIF < 2e-16 ***
## clarityVVS1 < 2e-16 ***
## clarityVVS2 < 2e-16 ***
## clarityVS1 < 2e-16 ***
## clarityVS2 < 2e-16 ***
## claritySI1 < 2e-16 ***
## claritySI2 < 2e-16 ***
## log(carat):cutIdeal 0.012966 *
## log(carat):cutVery Good < 2e-16 ***
## log(carat):cutGood 6.35e-07 ***
## log(carat):color_catNear Colorless < 2e-16 ***
## cutIdeal:color_catNear Colorless < 2e-16 ***
## cutVery Good:color_catNear Colorless 2.86e-06 ***
## cutGood:color_catNear Colorless < 2e-16 ***
## log(carat):cutIdeal:color_catNear Colorless 0.002115 **
## log(carat):cutVery Good:color_catNear Colorless 2.43e-10 ***
## log(carat):cutGood:color_catNear Colorless 0.000308 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1793 on 210615 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9781
## F-statistic: 4.281e+05 on 22 and 210615 DF, p-value: < 2.2e-16
While this model has significantly fewer variables and would be easier to analyze, in order to maintain rigor, we perform a partial F-test to see if we can confidently accept the RIM. For this test, the null and alternative hypotheses are \(H_{0}: \beta_{j}=\beta_{j+1}=\ldots=\beta_{k}=0 \text{, and } H_{a}: \text{at least one of } \beta_{j},\beta_{j+1},\ldots,\beta_{k} \neq 0\), respectively, where \(\beta_{j},\ldots,\beta_{k}\) are the slopes for the predictors present in the full interaction model, but absent from the RIM.
We see from the ANOVA results that the calculated F-statistic is 1.6645, and the p-value for our partial F-test is .001443, indicating that we reject the null hypothesis defined above in favor of the alternative. Contextually, this means that there exists statistical significance between the models, and evidence points to the conclusion that we should indeed continue to use the model with full interaction, over the model with reduced interaction.
We now test the assumptions of linearity and normality for the model. Our residuals plot and ACF show no issues.
The normal QQ plot appears to deviate for high values, furhter indicating that, at very high price points, the Blue Nile pricing methodology is different from lower values.
We see from performing Levene’s Test on all categorical predictors that the null hypothesis is not rejected – indicating that the population variances are equal among all classes and we maintain homoscedasticity. The Levene’s Test results below are for cut, color, and clarity respectively.
We now use the model to create prediction intervals for two diamonds randomly selected from the dataset.
## Warning in predict.lm(result_tot, predict1, level = 0.95, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 12.40206 12.06069 12.74343
## Warning in predict.lm(result_tot, predict2, level = 0.95, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 8.724493 8.385221 9.063765
The actual values for the selected diamonds fall within the prediction intervals, indicating that the model functions well.
Our full multiple regression model performed best, accounting for 98% of variability. How might this model be improved? Since data for expensive diamonds has a much different shape, it may help to do a piecewise split of the data and fit a model for each price tier. This may require that additional data be collected for good fit. It may also be helpful to add data points for natural vs. lab grown or conflict free vs blood diamonds.
[1] Pup, W. (28 July 2020). The Real Price of a Diamond. 2FI. http://2fi.ie/real-price-diamond/
[2] (28 July 2020). Diamond Quality Factors. GIA. https://www.gia.edu/diamond-quality-factor